biomedical dataset
Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models
Ge, Yao, Das, Sudeshna, Guo, Yuting, Sarker, Abeed
Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER.
Charlotte Bunne on developing AI-based diagnostic tools
Charlotte Bunne, head of EPFL's Artificial Intelligence in Molecular Medicine Group, is developing AI algorithms to better understand the incredibly complex and high-dimensional data that represent the hundreds of tissue layers and protein markers in an individual cell. EPFL magazine Dimensions spoke to Charlotte Bunne about her work at the cutting-edge of AI in medicine and biology. Could you describe the focus of your research? We are developing diagnostic tools for clinics that are driven by AI technologies. This includes forecasting the best treatment that a patient should receive, trying to understand the state of disease that a patient is in, and deciphering important biomarkers or potential drug targets that we should investigate further.
Learning Ordinality in Semantic Segmentation
Cristino, Rafael, Cruz, Ricardo P. M., Cardoso, Jaime S.
Semantic segmentation consists of predicting a semantic label for each image pixel. Conventional deep learning models do not take advantage of ordinal relations that might exist in the domain at hand. For example, it is known that the pupil is inside the iris, and the lane markings are inside the road. Such domain knowledge can be employed as constraints to make the model more robust. The current literature on this topic has explored pixel-wise ordinal segmentation methods, which treat each pixel as an independent observation and promote ordinality in its representation. This paper proposes novel spatial ordinal segmentation methods, which take advantage of the structured image space by considering each pixel as an observation dependent on its neighborhood context to also promote ordinal spatial consistency. When evaluated with five biomedical datasets and multiple configurations of autonomous driving datasets, ordinal methods resulted in more ordinally-consistent models, with substantial improvements in ordinal metrics and some increase in the Dice coefficient. It was also shown that the incorporation of ordinal consistency results in models with better generalization abilities.
Augmenting Biomedical Named Entity Recognition with General-domain Resources
Yin, Yu, Kim, Hyunjae, Xiao, Xiao, Wei, Chih Hsuan, Kang, Jaewoo, Lu, Zhiyong, Xu, Hua, Fang, Meng, Chen, Qingyu
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. In this paper, we proposed GERBERA, a simple-yet-effective method that utilized a general-domain NER dataset for training. Specifically, we performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with multiple additional BioNER datasets. Specifically, our models consistently outperformed the baselines in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight biomedical entity types sourced from five different corpora. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.
Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement
Dong, Hang, Chen, Jiaoyan, He, Yuan, Horrocks, Ian
Mentions of new concepts appear regularly in texts and require automated approaches to harvest and place them into Knowledge Bases (KB), e.g., ontologies and taxonomies. Existing datasets suffer from three issues, (i) mostly assuming that a new concept is pre-discovered and cannot support out-of-KB mention discovery; (ii) only using the concept label as the input along with the KB and thus lacking the contexts of a concept label; and (iii) mostly focusing on concept placement w.r.t a taxonomy of atomic concepts, instead of complex concepts, i.e., with logical operators. To address these issues, we propose a new benchmark, adapting MedMentions dataset (PubMed abstracts) with SNOMED CT versions in 2014 and 2017 under the Diseases sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic product. We provide usage on the evaluation with the dataset for out-of-KB mention discovery and concept placement, adapting recent Large Language Model based methods.
Machine Learning-Friendly Biomedical Datasets for Equivalence and Subsumption Ontology Matching
He, Yuan, Chen, Jiaoyan, Dong, Hang, Jimรฉnez-Ruiz, Ernesto, Hadian, Ali, Horrocks, Ian
Ontology Matching (OM) plays an important role in many domains such as bioinformatics and the Semantic Web, and its research is becoming increasingly popular, especially with the application of machine learning (ML) techniques. Although the Ontology Alignment Evaluation Initiative (OAEI) represents an impressive effort for the systematic evaluation of OM systems, it still suffers from several limitations including limited evaluation of subsumption mappings, suboptimal reference mappings, and limited support for the evaluation of ML-based systems. To tackle these limitations, we introduce five new biomedical OM tasks involving ontologies extracted from Mondo and UMLS. Each task includes both equivalence and subsumption matching; the quality of reference mappings is ensured by human curation, ontology pruning, etc.; and a comprehensive evaluation framework is proposed to measure OM performance from various perspectives for both ML-based and non-ML-based OM systems. We report evaluation results for OM systems of different types to demonstrate the usage of these resources, all of which are publicly available as part of the new Bio-ML track at OAEI 2022.
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents
Lin, Weixiong, Zhao, Ziheng, Zhang, Xiaoman, Wu, Chaoyi, Zhang, Ya, Wang, Yanfeng, Xie, Weidi
Foundation models trained on large-scale dataset gain a recent surge in CV and NLP. In contrast, development in biomedical domain lags far behind due to data scarcity. To address this issue, we build and release PMC-OA, a biomedical dataset with 1.6M image-caption pairs collected from PubMedCentral's OpenAccess subset, which is 8 times larger than before. PMC-OA covers diverse modalities or diseases, with majority of the image-caption samples aligned at finer-grained level, i.e., subfigure and subcaption. While pretraining a CLIP-style model on PMC-OA, our model named PMC-CLIP achieves state-of-the-art results on various downstream tasks, including image-text retrieval on ROCO, MedMNIST image classification, Medical VQA, i.e. +8.1% R@10 on image-text retrieval, +3.9% accuracy on image classification.
A Novel Weighted Combination Method for Feature Selection using Fuzzy Sets
Shen, Zixiao, Chen, Xin, Garibaldi, Jonathan M.
In this paper, we propose a novel weighted combination feature selection method using bootstrap and fuzzy sets. The proposed method mainly consists of three processes, including fuzzy sets generation using bootstrap, weighted combination of fuzzy sets and feature ranking based on defuzzification. We implemented the proposed method by combining four state-of-the-art feature selection methods and evaluated the performance based on three publicly available biomedical datasets using five-fold cross validation. Based on the feature selection results, our proposed method produced comparable (if not better) classification accuracies to the best of the individual feature selection methods for all evaluated datasets. More importantly, we also applied standard deviation and Pearson's correlation to measure the stability of the methods. Remarkably, our combination method achieved significantly higher stability than the four individual methods when variations and size reductions were introduced to the datasets.